| Hector R. Gavilanes | Chief Information Officer |
| Gail Han | Chief Operating Officer |
| Michael T. Mezzano | Chief Technology Officer |
University of West Florida
November 2023
The prcomp() function performs principal component analysis on a dataset using the singular value decomposition method with the covariance matrix of the data.
Driven by multicollinearity.
Features less significant in explaining variability.
All variables are numeric
Categorical Index variable.
34 missing values.
Imputation of missing values using the \(Mean\) (\(\mu\))
Mean (\(\mu\)=0); Standard Deviation (\(\sigma\)= 1)
\[ Z = \frac{{ x - \mu }}{{ \sigma }} \]
\[ Z \sim N(0,1) \]
3 Outliers
No leverage
Minimal difference.
No observations removed.



# reproducible random sampling
set.seed(my_seed)
# Create Target y-variable for the training set
y <- train_data$expected_survival
# Split the data into training and test sets
split <- sample.split(y, SplitRatio = 0.7)
training_set <- subset(train_data, split == TRUE)
test_set <- subset(train_data, split == FALSE) # Perform Principal Component Analysis (PCA) preprocessing on the training data
pca <- preProcess(training_set[, -target_index],
method = 'pca', pcaComp = 8)
# Apply PCA transformation to original training set
training_set <- predict(pca, training_set)
# Reorder columns, moving the dependent feature index to the end
training_set <- training_set[c(2:9, 1)]
# Apply PCA transformation to original test set
test_set <- predict(pca, test_set)
# Reorder columns, moving the dependent feature index to the end
test_set <- test_set[c(2:9, 1)]8 Principal Components
PCA is a definitely a useful tool to have in your toolkit!
[1] H. Abdi and L. J. Williams, “Principal component analysis,” WIREs Computational Statistics, vol. 2, no. 4, pp. 433–459, 2010, doi: https://doi.org/10.1002/wics.101. Available: https://wires.onlinelibrary.wiley.com/doi/abs/10.1002/wics.101
[2] S. R. Bennett, Linear algebra for data science. 2021. Available: https://shainarace.github.io/LinearAlgebra/index.html
[3] R. Bro and A. K. Smilde, “Principal component analysis,” Analytical methods, vol. 6, no. 9, pp. 2812–2831, 2014.
[4] F. Chumney, “PCA, EFA, CFA,” pp. 2–3, 6, Sep., 2012, Available: https://www.westga.edu/academics/research/vrc/assets/docs/PCA-EFA-CFA_EssayChumney_09282012.pdf
[5] D. Esposito and F. Esposito, Introducing machine learning. Microsoft Press, 2020.
[6] B. Everitt and T. Hothorn, An introduction to applied multivariate analysis with r. Springer Science & Business Media, 2011.
[7] R. A. Fisher and W. A. Mackenzie, “Studies in crop variation. II. The manurial response of different potato varieties,” The Journal of Agricultural Science, vol. 13, no. 3, pp. 311–320, 1923.
[8] F. L. Gewers et al., “Principal component analysis: A natural approach to data exploration,” ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–34, 2021.
[9] M. Greenacre, P. J. Groenen, T. Hastie, A. I. d’Enza, A. Markos, and E. Tuzhilina, “Principal component analysis,” Nature Reviews Methods Primers, vol. 2, no. 1, p. 100, 2022.
[10] B. M. S. Hasan and A. M. Abdulazeez, “A review of principal component analysis algorithm for dimensionality reduction,” Journal of Soft Computing and Data Mining, vol. 2, no. 1, pp. 20–30, 2021.
[11] J. Hopcroft and R. Kannan, Foundations of data science. 2014.
[12] H. Hotelling, “Analysis of a complex of statistical variables into principal components.” Journal of educational psychology, vol. 24, no. 6, p. 417, 1933.
[13] I. T. Jolliffe and J. Cadima, “Principal component analysis: A review and recent developments,” Philosophical transactions of the royal society A: Mathematical, Physical and Engineering Sciences, vol. 374, no. 2065, p. 20150202, 2016.
[14] J. Lever, M. Krzywinski, and N. Altman, “Points of significance: Principal component analysis,” Nature methods, vol. 14, no. 7, pp. 641–643, 2017.
[15] D. G. Luenberger, Optimization by vector space methods. John Wiley & Sons, 1997.
[16] J. Maindonald and J. Braun, Data analysis and graphics using r: An example-based approach, vol. 10. Cambridge University Press, 2006.
[17] J. Pagès, Multiple factor analysis by example using r. CRC Press, 2014.
[18] K. Pearson, “LIII. On lines and planes of closest fit to systems of points in space,” The London, Edinburgh, and Dublin philosophical magazine and journal of science, vol. 2, no. 11, pp. 559–572, 1901.
[19] F. Pedregosa et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
[20] M. Ringnér, “What is principal component analysis?” Nature biotechnology, vol. 26, no. 3, pp. 303–304, 2008.
[21] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, 1991.
[22] S. Zhang and M. Turk, “Eigenfaces,” Scholarpedia, vol. 3, no. 9, p. 4244, 2008.
[23] E. K. CS, “PCA problem / how to compute principal components / KTU machine learning.” YouTube, 2020. Available: https://youtu.be/MLaJbA82nzk
[24] S. Nash Warwick and W. Ford, “Abalone.” UCI Machine Learning Repository, 1995. Available: https://doi.org/10.24432/C55C7W
[25] “Quarterly dialysis facility care compare (QDFCC) report: July 2023.” Centers for Medicare & Medicaid Services (CMS). Available: https://data.cms.gov/provider-data/dataset/2fpu-cgbb. [Accessed: Oct. 11, 2023]
[26] R Core Team, “R: A language and environment for statistical computing.” R Foundation for Statistical Computing, Vienna, Austria, 2023. Available: https://www.R-project.org/
[27] A. Kassambara and F. Mundt, Factoextra: Extract and visualize the results of multivariate data analyses. 2020. Available: https://CRAN.R-project.org/package=factoextra
[28] H. Wickham et al., “Welcome to the tidyverse,” Journal of Open Source Software, vol. 4, no. 43, p. 1686, 2019, doi: 10.21105/joss.01686
[29] H. Zhu, kableExtra: Construct complex table with ’kable’ and pipe syntax. 2021. Available: https://CRAN.R-project.org/package=kableExtra
[30] D. Comtois, Summarytools: Tools to quickly and neatly summarize data. 2022. Available: https://CRAN.R-project.org/package=summarytools
[31] T. Wei and V. Simko, R package ’corrplot’: Visualization of a correlation matrix. 2021. Available: https://github.com/taiyun/corrplot
[32] B. Cui, DataExplorer: Automate data exploration and treatment. 2020. Available: https://CRAN.R-project.org/package=DataExplorer
[33] A. Hebbali, Olsrr: Tools for building OLS regression models. 2020. Available: https://CRAN.R-project.org/package=olsrr
[34] Kuhn and Max, “Building predictive models in r using the caret package,” Journal of Statistical Software, vol. 28, no. 5, pp. 1–26, 2008, doi: 10.18637/jss.v028.i05. Available: https://www.jstatsoft.org/index.php/jss/article/view/v028i05
[35] D. Lüdecke, sjPlot: Data visualization for statistics in social science. 2023. Available: https://CRAN.R-project.org/package=sjPlot
[36] J. Tuszynski, caTools: Tools: Moving window statistics, GIF, Base64, ROC AUC, etc. 2021. Available: https://CRAN.R-project.org/package=caTools
[37] A. Kassambara, Ggcorrplot: Visualization of a correlation matrix using ’ggplot2’. 2023. Available: https://CRAN.R-project.org/package=ggcorrplot
[38] K. H. Liland, B.-H. Mevik, and R. Wehrens, Pls: Partial least squares and principal component regression. 2023. Available: https://CRAN.R-project.org/package=pls
[39] B. Hamner and M. Frasco, Metrics: Evaluation metrics for machine learning. 2018. Available: https://CRAN.R-project.org/package=Metrics